You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version
Process Documents
(Text Processing)
Synopsis
Generates word vectors from a text object.Description
This operator uses one single TextObject as input for generating a term vector. The resulting exampleset will hence consist of only one single example. This makes this operator especially useful for applying a model on one single text. But since the SingleTextInputOperator even provides a parameter for specifying the text, this one is more appropriate if used by a program, where a TextObject might simply be constructed and passed to the process.
Input
word list
The word list port.
documents (Collection)
The documents port.
Output
example set (Data Table)
The example set port.
word list
The word list port.
Parameters
- create_word_vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document. Range:
- vector_creationSelect the schema for creating the word vector. Range:
- add_meta_informationIf checked, available meta information of the text like filename, date is added as attribute. Range:
- keep_textIf checked, the input text will be stored as a special String attribute with the role text. Range:
- prune_methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified. Range:
- prune_below_percentIgnore words that appear in less than this percentage of all documents. Range:
- prune_above_percentIgnore words that appear in more than this percentage of all documents. Range:
- prune_below_absoluteIgnore words that appear in less than that many documents. Range:
- prune_above_absoluteIgnore words that appear in more than that many documents. Range:
- prune_below_rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned. Range:
- prune_above_rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned. Range:
- datamanagementDetermines, how the data is represented internally. Range:
- parallelize_vector_creationDetermines whether the execution of Vector Creation should be parallelized. Range: